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Abstract 

Formality is one of the most important di- 
mensions of writing style variation. In this 
study we conducted an inter-rater reliabil- 
ity experiment for assessing sentence for- 
mality on a five-point Likert scale, and ob- 
tained good agreement results as well as 
different rating distributions for different 
sentence categories. We also performed 
a difficulty analysis to identify the bottle- 
necks of our rating procedure. Our main 
objective is to design an automatic scor- 
ing mechanism for sentence-level formal- 
ity, and this study is important for that pur- 
pose. 

1 Introduction 

Formality of language is an important dimension 



of writing style variation (Biber, 1988, Heylighen 



and Dewaele, 1999] ). Academic papers are usu- 
ally written more formally than blog posts, while 
blog posts are usually written more formally than 



forum threads (Lahiri, et al., 2011 ). The concept 



of formality has so far been explored from three 
different levels - the document-level (Heylighen 



and Dewaele, 1999) , the word-level ( |Brooke, et| 
al, 2010| ), and the sentence-level ( |Lahiri, et al.,| 



201 1 ). All these studies have directly or indirectly 



shown that formality is a rather subjective concept, 
and there exists a continuum of formality so that 
linguistic units (e.g., a word, a sentence or a docu- 
ment) may never be classified as "fully formal" or 
"fully informal", but they should rather be rated 
on a scale of formality. For example, consider 
the following three sentences: "Howdy! !", "How r 
u?" and "How are you?". Note that each sentence 
is more formal than the previous one, and the for- 



malization process can be continued forever. Hey- 
lighen and Dewaele (1999) in their seminal work 
on document formality have explained this issue 
by defining two different variants of formality - 
surface and deep. The surface variant formal- 
izes language for no specific purpose other than 
stylistic embellishment, but the deep variant for- 
malizes language for communicating the meaning 
more clearly and completely. More complete com- 
munication of meaning involves context-addition, 
which can be continued ad infinitum, thereby re- 
sulting in sentences that are always more (deeply) 
formal than the last one. Heylighen and Dewaele 
also discussed the use of formality to obscure 
meaning (e.g., by politicians), but it was treated 
as a corruption of the original usage. 

Heylighen and Dewaele's quantification of deep 
formality is not as reliable when we look into the 
sub-document level. At the word level, a very dif- 
ferent approach for dealing with the issue of for- 



mality has been proposed by Brooke, et al (2010). 
They experimented with several word-level for- 
mality scores to determine the one that best associ- 
ated with hand-crafted seed sets of formal and in- 
formal words, as well as words co-occurring with 
the seed sets. Lahiri, et al. ( |201 1 j ) explored the 
concept of sentence-level formality from two dif- 
ferent perspectives - deep formality of annotated 
and un-annotated sentence corpora, and inherent 
agreement between two judges on an annotation 
task. They observed that the deep formality of 
sentences broadly followed the corpus-level trend, 
and correlated well with human annotation. It was 
also reported that when the annotation judgment 
was binary (i.e., formal vs informal sentence) and 
no prior instructions were given to the annotators 
as to what constitutes a formal sentence, there was 
very poor inter-annotator agreement, which in turn 



showed how inherently subjective the concept of 
formality is. 

Our work is a direct extension of the inter- 
annotator agreement reported by Lahiri, et 
al ( |2011[ ). Instead of binary annotation (for- 
mal/informal sentence), we adopted a 1-5 Likert 
scale, where 1 represents a very informal sen- 
tence and 5 a very formal sentence. Keeping 
prior instructions to a minimum, we observed that 
the inherent agreement results using Likert scale 
were better than the results using binary annota- 
tion. This observation validates the presence of 
formality continuum at the sentence level. It also 
helped us construct a seed set of sentences with 
human-assigned formality ratings. This seed set 
can be used in evaluating an automatic scoring 
mechanism for sentence-level formality. Note that 
adding up word-level scores is not appropriate for 
this purpose, because it may so happen that all the 
words in a sentence are formal, but the sentence as 
a whole is not so formal (e.g., "For all the stars in 
the sky, I do not care."). 

This paper is organized as follows. In Section [2] 
we explain the design of our study and its ratio- 
nale. Section [3] gives the experimental results and 
difficulty analysis. We conclude in Section|4j out- 
lining our contributions. 

2 Study Design 

We adopted a five -point Likert scale for the for- 
mality annotation of sentences. The 1-5 scale is 
easily interpretable, widely used and well-suited 
for ordinal ratings. The annotators were requested 
to rate each sentence as follows: 1 - Very Infor- 
mal, 2 - Informal, 3 - In-between, 4 - Formal, 5 - 
Very Formal, X - Not Sure. The annotators were 
not given any instructions as to what constitutes 
a very formal sentence, what constitutes a very in- 
formal sentence, etc. They were, however, advised 
to keep in mind that the ratings were relative to 
each other, and were requested to be consistent in 
their ratings, and rate sentences independently. 

We conducted the inter-rater agreement study in 
two phases. In the warm-up (pilot) phase, we gave 
100 sentences to the raters, and observed if they 
were able to do the ratings on their own, and if the 
agreement was good or not. Then we proceeded 
to the actual annotation phase with 500 sentences. 
The difference between these two phases was that 
in the warm-up phase, the raters sat together in 
our presence, working independently and getting 



a feel of the task. We, however, did not provide 
any instructions on how to rate the sentences, and 
the raters were completely on their own. In the ac- 
tual phase, the raters worked separately and in our 
absence. 

Two raters participated in this study. Both were 
female undergraduate sophomore students, and 
both were native English speakers at least 1 8 years 
of age. The raters were selected randomly from a 
pool of respondents who emailed us their consent 
to participate in this study. The warm-up phase 
took less than an hour, and the actual phase took 
approximately one and a half hours. 

The sentences were selected from the four 
datasets used in ( [Lahiri, et al., 20TT] ). For the 
warm-up set, we randomly picked 25 sentences 
from each category (blog, news, forum and pa- 
per), and for the actual set, we randomly picked 
125 sentences from each category. The warm- 
up set and the actual set were mutually exclusive, 
and sentences in each set were scrambled so that 
(a) raters did not know which sentence falls under 
which category, and (b) raters were not influenced 
by the original ordering of sentences. 

3 Results 

We performed three types of analysis on the warm- 
up as well as on the actual set of sentence rating^] 
The first type attempts to find out the agreement 
and correlation between the two raters, and how 
similar the ratings were. The second type of anal- 
ysis explores the properties of rating distributions 
and whether distributions for different categories 
of sentences (i.e., blog, forum, news or paper) are 
different. The third type of analysis deals with two 
kinds of difficult sentences and their relative fre- 
quencies. The two kinds of difficult sentences are 
X-marked sentences and sentences for which the 
raters differed by two or more points. 

3.1 Agreement and Correlation 

We report four nonparametric correlation coeffi- 
cients between the two raters, as well as cosine and 
Tanimoto similarity (Tani moto, 1957] ) between the 
two rating vectors^] Each element in a rating vec- 
tor corresponds to a sentence and the value of the 
element is the formality rating of the sentence. We 



also report Cohen's k (Cohen, 19601 and Krip 



Code and data available at jhttp : / /www . personal ■ | 
lpsu.edu/sxl 382/1 

z We used MATLAB for all our analyses. 
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Table 1 : Agreement and correlation values on the actual set. 



pendorff's a ( Krippendorff, 2007) for measur- 
ing quantitative agreement between the two raters. 
These results were obtained after pruning the X- 
marked sentences. Table Q] shows the results for 
the actual set. Overall results (the rightmost col- 
umn) indicate that the cosine and Tanimoto sim- 
ilarity between the raters were fairly high, which 
shows that the rating directions were preserved. In 
other words, if rater A rated sentence S 1 as more 
formal than sentence S2, then rater B also rated S 1 
as more formal than S2, not the other way round. 
This shows the consistency of our raters and the 
importance of Likert scale in formality judgment. 
High similarity values were also obtained within 
specific categories (forum, blog, news and paper 
sentences), showing that rating consistency was 
maintained across categories. Similar results were 
obtained for the warm-up set as well. 

Correlation between two raters was measured 



with four non-parametric tests - the 7-test ( [Good- 



man and Kruskal, 1954L Kendall's r a and 



Tft (Kendall, 1938), and Spearman's p. The 7-test 
and t;, are particularly well-suited for measuring 
similarity between ordinal ratings, because they 
emphasize the number of concordant pairs over 
the number of discordant pairs. We obtained a 
fairly high value for the overall 7 for both the ac- 
tual and the warm-up set, thereby showing good 
inherent agreement between annotators. Values 
for Kendall's r a and Tb, and Spearman's p were 
not as high, but they were all found to be statis- 
tically significant (i.e., significantly different from 
0) with p- value < 0.05. Only for the "paper" cat- 
egory, the p- values were found to be > 0.05 for 7, 
Spearman's p, and Kendall's r a . For the warm-up 
set, p-values were found to be > 0.05 for Spear- 



man's p and Kendall's r a under the "blog" cate- 
gory. All others were statistically significant. Note 
that the p-values for Kendall's Tb, Krippendorff's 
a and 7-test are one-tailed and computed by boot- 
strapping (1000 bootstrap samples) under the null 
hypothesis that the observed correlation is 0. 

Inter-rater reliability was measured with Co- 
hen's k and Krippendorff's a. Justification for 



using the latter is given in (Artstein and Poe- 



sio, 2008 ). When category labels are not equally 
distinct from one another (as is our case), Krip- 
pendorff's a must be computed. The values 
are reported in Table [T] Note that Krippen- 
dorff's a allows missing data as well, so we could 
have incorporated the X-marked sentences in a- 
computation. But to avoid complication, we chose 
not to do so, and quarantined the X-marked sen- 
tences for further analysis. Observe from Table [T] 
that although the category-wise K-values indicate 



slight or no agreement (Landis and Koch, 1977), 
the overall K-value for the actual set indicates fair 
agreement. This is a significant achievement given 
the conservativeness of k, the subjectivity associ- 
ated with formality judgment, our small dataset, 
and no prior instructions on what to consider for- 
mal and what to consider informal. This result 



is better than the one reported in (Lahiri, et al., 



201 ip (KBiog 0.164, KNews 0.019), which shows 
the merit of Likert-scale annotation for formality 
judgment. The overall K-values were found to be 
statistically significant with p-value < 0.005. 

3.2 Rating Distributions 

The distributions of sentence formality ratings 
(Figure [T]) for the actual set indicate that Rater 1 
tended to rate sentences more formally on average 
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Figure 1 : Sentence formality rating histograms for 
the actual set (mean rating and sd in parentheses). 

than Rater 2 (same conclusion from paired t-test 
and U test ( [Mann and Whitney, 1947] ) with 95% 
confidence). Figure [T] shows that the two raters 
rated almost the same number of sentences as ei- 
ther 1 or 3. In other words, the number of very in- 
formal as well as "in-between-type" sentences ap- 
pears to be consistent across two raters. But Rater 
1 considered a large number of sentences "formal" 
(i.e., rating 4), whereas Rater 2 considered an al- 
most equally large number of sentences informal 
(i.e., rating 2). On the other hand, relatively fewer 
sentences were considered "very formal" or "very 
informal". One possible reason for this behavior is 
the so-called "central tendency bias'j^] which we 
consider a limitation of our study. 

To determine if the rating distributions under 
different categories (blog, forum, news and pa- 
per) were significantly different from each other, 
we performed the non-parametric Kruskal-Wallis 
test (Kru skal and Wallis, 1952 ). For both raters 
and for both actual and warm-up sets, the results 
indicated that at least one category differed from 
others in formality rating. The non-parametric 
U Test on category pairs (with Bonferroni cor- 
rection ( [Dunn, 1961 [ I for multiple comparison) 
showed the formality ratings under each category 
to be significantly different from others (95% con- 
fidence). Only in the warm-up set, the blog and 
news ratings were not found to be significantly dif- 
ferent for either of the raters. We also performed 
a Kolmogorov-Smirnov test ( |Smirnov, 1948) ) to 
see if the distributions were significantly different 
from each other. For the warm-up set, the results 
followed U Test, although for one rater, blog and 
forum sentence ratings were not found to be sig- 
nificantly different. For the actual set, for one rater 
blog and news sentence ratings were not found to 
be significantly different. 



See, for example, http://en.wikipedia.org/ 



wiki/Likert_scale 



Following the U Test results, we note that the 
category-wise sentence formality rating distribu- 
tions were significantly different from each other, 
and the general trend of mean and median rat- 
ings followed the intuition that the "paper" cate- 
gory sentences are more formal than the "blog" 
and "news" categories, which in turn are more for- 
mal than the "forum" category. 

3.3 Difficulty Analysis 

There were 25 X-marked sentences in the ac- 
tual set (5%), and six in the warm-up set (6%). 
These sentences represent confusing cases that at 
least one rater marked as "X". These are primar- 
ily system error and warning messages, program- 
ming language statements, incomplete sentences, 
and two sentences merged into one. The last two 
types of sentences arose because of imprecise sen- 
tence segmentation. A manual cleaning to remove 
such cases from the original datasets seemed pro- 
hibitively time-consuming. Many of these sen- 
tences are from the "paper" category. 

The second type of difficulty concerns the sen- 
tences for which the annotators differed by two or 
more points. There were 40 such cases in the ac- 
tual set, and 7 cases in the warm-up set. These sen- 
tences were either too long, or too short, or gram- 
matically inconsistent. Many of them were in- 
complete sentences, or two sentences merged into 
one. Note that since we did not provide the anno- 
tators with a detailed guideline on what to consider 
formal and what informal, they freely interpreted 
the too-long, too-short and grammatically incon- 
sistent sentences according to their own formality 
judgment. This is precisely where the subjectivity 
in their judgments kicked in. However, such cases 
were never a majority. 

4 Conclusion 

In this paper we reported an inter-rater agreement 
study for assessing sentence formality on a five- 
point Likert scale. We obtained better and consis- 
tent agreement values on a set of 500 sentences. 
Sentences from different categories (blog, forum, 
news and paper) were shown to follow different 
formality rating distributions. We also performed 
a difficulty analysis to identify problematic sen- 
tences, and as a by-product of our study, we ob- 
tained a seed set of human-annotated sentences 
that can later be used in evaluating an automatic 
scoring mechanism for sentence-level formality. 
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